In this era of Big Data, it can be possible to have too much data. We might have so much data that it makes our brains or computers crash when we try to process it. What can we do when this happens?
The simplest answer is to just throw some of the data away. If we have a dataset $X$ with $n$ examples and $p$ features, we could get rid of some of the examples to have a more manageable dataset of size $m \times p$ (where $m < n$) or we could randomly pick some features and throw them away to yield a dataset of size $n \times q$.
This approach, whilst quick and easy, is far from optimal. Just throwing information away usually makes our models worse and modelling complex relationships between features is hard enough already!
We still want to reduce the size of our dataset though, so consider the following approach: We sift through each feature in the dataset and look for ones we can get rid of. How do we decide if a feature is a suitable candidate to be discarded? Let's think of an example: Suppose we were trying to use information about houses (size in $m^2$, distance to the nearest shop, number of bedrooms etc.) to predict house prices and we noticed that all of the houses in our dataset were 3 bedroom houses. The number of bedrooms feature would be an excellent feature to be discarded because there is no variability for that feature in the dataset - it doesn't allow us to discriminate between different houses and so is of no use to us.
Although it's an improvement on the random deletion approach, sifting through all of the features suffers from two main drawbacks:
Principal components analysis operates using a similar principle to the one we described above. It works as follows for a $n \times p$ dataset $X$.
These $p$ Principal Components define a new coordinate system for the data - the key point to recognise is that the data are still the same, they're just being represented differently. If you've not thought about changing coordinate systems before, it might help to think of a real world example where we use alternative coordinate systems:
One way of precisely defining the shade of a colour is to express it using the RGB system - although we're skipping over the finer details here, we express a colour as a vector $(x_1, x_2, x_3)$ where $x_1, x_2, x_3$ represent the intensities of red, green and blue, respectively. For example:
Instead of using red, green and blue as the basis for describing a colour, some processes (most notably your printer) use an alternative basis. They use the respective intensities of cyan, magenta and yellow to describe a colour. Under this system, we would represent pure yellow as (0, 0, 255) - it's exactly the same colour that we would get representing it as (255, 255, 0) under the RGB system, but it just has a different representation under a different coordinate system.
Changing coordinate systems in this way gives us a new dataset $X^*$ (still with dimensions $n \times p$). Going back to the house prices example above, our features no longer have convenient names such as size in $m^2$, but they do have the benefit that we know which ones have the most variance (and are therefore theoretically most likely to be useful in modelling the label, which is our goal). If we want to work with a smaller dataset, we can now just discard the $(k+1)^{th} \text{ to } p^{th}$ Principal Components and fit a model using the first $k$ Principal Components. In theory these $k$ features will contain more information than any $k$ features in the original dataset and should give us a better model for the same computational expense.
Before we describe how we actually go about finding the Principal Components, let's generate the dataset we're going to transform and visualise it
#Import modules
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
n = 200 #Number of observations
x1 = np.random.uniform(0,10,n)
x2 = x1 + np.random.normal(0,2,n)
X = pd.DataFrame({'X1':x1, 'X2':x2})
X.head()
plt.scatter(x1, x2)
plt.xlim(-5,15)
plt.ylim(-5,15)
plt.show()
This dataset only has two variables, so ordinarily we wouldn't bother reducing its dimension because it's manageable enough already. However for the purposes of illustration it's easy for us to visualise the change when we turn a 2D dataset into a 1D dataset.
Now it might not have been clear above exactly how we would go about finding out which direction in the 2D plane has the most variation. To do so, we have to project the data onto that direction. Lets say we have a point $(x_1, x_2)$ and the direction we want to project it onto is the line $l$ with equation $X_2 = m \times X_1 + c$. To project $(x_1, x_2)$ onto the line $l$ we simply draw a line which is perpendicular to $l$ and mark the point where they intersect. That intersection point is then the projection of $(x_1, x_2)$ onto $l$.
Below is a few of examples of projecting our dataset $X$ onto different lines
line1 = np.array([0,1]) #The line x = 0
line2 = np.array([np.sqrt(2), np.sqrt(2)]) #The line y = x
line3 = np.array([-np.sqrt(2),np.sqrt(2)]) #The line y = -x
def projectPointOntoLine(point, line):
return (np.dot(point, line)/np.dot(line, line))*line
projLine1 = np.array([projectPointOntoLine(x, line1) for idx, x in X.iterrows()]) #Projection onto line 1
projLine2 = np.array([projectPointOntoLine(x, line2) for idx, x in X.iterrows()]) #Projection onto line 2
projLine3 = np.array([projectPointOntoLine(x, line3) for idx, x in X.iterrows()]) #Projection onto line 3
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,6))
fig.suptitle('Projecting X onto different lines')
ax1.scatter(projLine1[:,0], projLine1[:,1])
ax1.set_xlim((-5,15)); ax1.set_ylim((-5,15))
ax1.title.set_text('The line x = 0')
ax2.scatter(projLine2[:,0], projLine2[:,1])
ax2.set_xlim((-5,15)); ax2.set_ylim((-5,15))
ax2.title.set_text('The line y = x')
ax3.scatter(projLine3[:,0], projLine3[:,1])
ax3.set_xlim((-5,15)); ax3.set_ylim((-5,15))
ax3.title.set_text('The line y = -x')
Immediately we can see that the points are more spread out when we project onto the line $y = x$ than if we project onto the line $y = -x$. This gives us a good indication that there is more variability in the direction given by the line $y = x$ than in the direction given by the line $y = -x$.
This is of course a very ad-hoc and non-rigourous method for determining the direction in which the data exhibit the most variation.
We won't go into a full derivation of how to find the Principal Components - there are lots of resources on the internet where you can find such a proof.
To obtain the Principal Components for a dataset, $X$ of dimension $n \times p$:
class PCA:
def __init__(self, data):
self.data = data
self.dataArray = data.to_numpy() #Occasionally we'll want to multiply arrays together
def computeCovMatrix(self):
#Given an n x p input matrix, X, compute the associated covariance matrix
#defined as 1/(n-1) * transpose(X) * X
n = self.data.shape[0]
self.covMat = 1/(n-1)*np.matmul(np.transpose(self.dataArray), self.dataArray)
return 0
#Covariance matrix is a square matrix of size p x p
def getEigenvalueDecomposition(self):
#Input is the p x p cov matrix
#We want to compute the eigenvalues and eigenvectors of the covariance matrix
self.eigVals, self.eigVecs = np.linalg.eig(self.covMat) #eigVals is a vector of length p, eigVecs is a p x p matrix
#The first element of eigVals is the eigenvalue associated with the first eigenvector in eigVecs (and so on)
return 0
def fitPCA(self, numComponents):
#Obtain cov matrix and decompose it
self.computeCovMatrix()
self.getEigenvalueDecomposition()
#Now we want to obtain the eigenvectors associated with the largest numComponents eigenvalues
#There's a few ways to do this, one is to make a dictionary where they keys are eigenvalues and the
#values are eigenvectors - in that way we establish a link between eigenvalue and eigenvector, so we
#can sort the eigenvalues and still know which eigenvector we need
self.eigDict = {}
for idx, eigenValue in enumerate(self.eigVals):
self.eigDict[eigenValue] = self.eigVecs[:,idx] #Store in a dictionary
self.Vk = [] #This will be the p x k matrix made up of the first k eigenvectors
for eigenValue in list(pd.Series(list(self.eigDict.keys())).sort_values(ascending = False))[0:numComponents]:
self.Vk.append(self.eigDict[eigenValue])
self.Vk = np.array(self.Vk).reshape(-1,numComponents) #Convert to be an array and make sure array dimensions match up
#Now that we have the first K eigenvectors stored in a new matrix, we can convert our original dataset
#into the dimension-reduced form
self.dataReduced = np.matmul(self.dataArray, self.Vk)
#Finally lets clean up turn this into a pandas dataframe and return it
outDict = {}
for i in range(numComponents):
outDict[f'PC{i+1}'] = self.dataReduced[:,i]
outDF = pd.DataFrame(outDict)
return outDF
myPCA = PCA(X)
XStar = myPCA.fitPCA(2)
XStar.head()
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(20,10))
ax1.scatter(XStar['PC1'], XStar['PC2'])
ax1.set_xlim((-20,5))
ax1.set_ylim([-20, 5])
ax1.set(xlabel='Principal Component', ylabel='Second Principal Component', title = 'Principal Components')
#Also want to plot the original data with the first and second pc running through
#First PC
principalComponent = myPCA.Vk[:,0]
pcDirection = principalComponent[1]/principalComponent[0]
#Second PC
principalComponent2 = myPCA.Vk[:,1]
pcDirection2 = principalComponent2[1]/principalComponent2[0]
ax2.scatter(X['X1'], X['X2'])
ax2.set_xlim((-5,15))
ax2.set_ylim([-5,15])
ax2.set(xlabel='X1', ylabel='X2', title = 'Original Data')
#Plot their lines
ax2.plot(np.linspace(-5,15), pcDirection*np.linspace(-5,15), c = 'g',linewidth = 6, label = 'Principal Component')
ax2.plot(np.linspace(-5,15), 10 + pcDirection2*np.linspace(-5,15), c = 'r', linewidth = 6, label = '2nd Principal Component')
ax2.legend()
plt.show()
We can see that the x axis of the left plot, which corresponds to the first Principal Component, exhibits a lot more variation than the corresponding y axis
On the right hand plot we've plotted the original dataset with the directions corresponding to the Principal Components running through the data, the direction with the greatest variance looks like we might have intuitively expected it to.